This pipeline serves to interpret DESeq2 output of mRNA DGE Experiments using a PCA plot, MA Plots for different contrasts, an overall counts heatmap, replicate scatter plots within each condition, and mean scatterplots for different contrasts that highlight statistically significant observations. Required packages are DESeq2, tximport, knitr, and heatmaply. Within the working directory, seqeuncing results files (already filtered, trimmed, and mapped to a reference genome) will need to be present in addition to a csv called ‘samples.csv’ with columns in the following format: “files”, “replicates”, “condition”, and “control_condition”. The first column is a list of the file names to be imported from the working directory. The second column consists of the treatment type and replicate number associated with each corresponding results file (for example, ‘WT_1’). The third column consists of the generalized treatment type/condition for each file without replicate numbers (for example, simply ‘WT’), and the fourth column is a logical vector (TRUE/FALSE) establishing whether or not each condition/file is a control group condition/file. The first file listed for each condition will be used to establish the control group status from the fourth column. The pipeline is generalized to a variety of experimental types. If one control group is listed, all other condition groups will be compared against the control group, with MA Plots and mean reads scatterplots drawn for each contrast between control and experimental groups. If multiple control groups are listed, or if no control group is listed, MA Plots and mean reads scatterplots will be drawn for all possible contrasts between different conditions.
The samples table is generated from the samples.csv file in the working directory. A list of files named by their replicate number and type is created for the tximport function. The ‘control_condition’ object summarizes each unique condition as a control or experimental group according to a named logical vector. The DESeqDataSetFromTximport function requires a table consisting of the condition column with each row named according to the associated replicated number and type. This table is named ‘sampleTable’. After tximport is run to interpret the results files, the dds object compiles the imported files and runs them through DESeq2. An overall counts table is built from dds and saved in the working directory as a csv file with a preceding date stamp. Next, the list of intended contrasts is stored in the ‘Combinations’ object, which is built from the information in ‘control_condition’. ‘Combinations’ is used for the MA Plots and mean reads scatterplots. For each contrast, a counts table csv is made and saved in the working directory with a preceding date stamp. These counts tables include the counts for each replicate, the fold change between contrasted conditions, and the adjusted P-Value of the fold change. In addition, if a csv file called “Gene_Table.csv” is present in the working directory, with the csv consisting of a list of GeneIDs in one column and GeneIDs or common gene names (if available) in the second column, then the counts tables will be built with a column including the common names (if available; GeneID used if common name is not available). If “Gene_Table.csv” is not present in the working directory, the counts tables will include repeated GeneID columns. After the counts table csv files are made, a variance-stablizing transformation is performed for the PCA plot. Then, the counts table is log2 transformed and put into the object ‘cts_heat’, and rows with an average count below 3 (post-transformation) are removed from ‘cts_heat’ for heatmap preparation. This table is clustered by row using euclidean distance and complete clustering. Finally, a table is generated compiling the log2 of average counts by gene across all replicates for each condition. Infinite values are replaced with -4. This counts table is used for the mean reads scatterplots.
The variance stabilizing transform data is used to create a plot for the first two principal components. The output is a PDF file labeled with a preceding date stamp.
Two MA plots are created for each intended contrast/combination based on the samples.csv file. The first MA Plot draws from the DESeq results file for the appropriate contrast, and the second MA plot transforms the general DESeq object with log2 fold change shrinkage before plotting. Output files are labeled by contrast with a preceding date stamp.
The object ‘cts_heat’, which consists of all log2 counts from ‘cts’ that are above a row threshold of 3 (which translates to an average read number of 8 across all samples for a given gene), is used to create an interactive heatmap for the counts table. This heatmap is clustered by row (prior to the plotting call) using euclidean distance and a complete clustering method. Darker blue cells correspond to higher log2 counts. The heatmap includes hover text describing each cell, as well as zooming and panning features for analyzing smaller groups of cells. The output is an interactive HTML widget.